

Using Computing and Data Grids 
for Large-Scale Science and Engineering 

William E. Johnston 1 

National Energy Research Scientific Computing Division, 

Lawrence Berkeley National Laboratory and 
Numerical Aerospace Simulation Divis.on, NASA Ames Research Center 

Abstract 

, . _ rrp tn re f er to a software system that provides uniform and location 

and scalable 

information pX^GrirCW) and DOE’s Science Grid, and some of the scaling issues that 
have come up in their implementation. 

Keywords: Grids, distributed computing architecture, distributed resource management, secunty 

1 Introduction 

“Grids” (see rill are an approach for build.ng dynamically constructed problem solving 
environments using geographically and organizationally dispersed high performance comparing 

and data handling resources. 

. =g construction, management, and use of widely distributed application , systems 
. facilitating human collaboration and remote access to, and operation of, scientific and 

engineering instrumentation systems 
. managing and securing this computing and data infrastructure 

be summarized as 


♦ information services 
+ resource co-scheduling 

+ data access 

; . 

4 authentication and authorization 

4 secunty services 

« auditing 

4 monitoring 

♦ global event services 

4 global queuing 

♦ data cataloguing 

♦ resource brokering 

4 collaboration and remote instrument services 

4 data location management 

4 communication services 

4 fault management 


l_ wejohnston@l bl.gov, http://www.itg.lbl.gov; wej@nas.nasa.gov, http://www.ipg.nasa.gov 


1 



The overall motivation for the current large-scale (multi-institutional) Grid projects is to enable 

sZTmT^Z mte , raC ‘ 10nS that facihtate } lar ge-scale science and engineering such as aerospace 
systems design, high energy physics data analysis, climatology, large-scale remote instrument 
operation, etc. 

^j° n f r C ° mpUting ’ data ’ and instrument Grids is that they will provide significant new 
capabilities to scientists and engineers by facilitating routine construction of information based 
problem solving environments that are built on-demand from large pools of resources. That is 
Gnds will routinely - and easily, from the user’s point of view - facilitate applications such as- 
• coupled, multidisciplinary simulations too large for single computing systems (e.g. 
multi-component turbomachine simulation - see [2]) 

management of very large parameter space studies where thousands of low fidelity 
simulations explore, e.g., the aerodynamics of the next generation space shuttle in its many 
operating regimes (from Mach 27 at entry into the atmosphere to landing) 
use of widely distributed, federated data archives (e.g., simultaneous access to metrological 
topological, aircraft performance, and flight path scheduling databases supporting a 
National Air Transportation Simulation system) 

coupling large-scale computing and data systems to scientific and engineering instruments 
so hat complex real-time data analysis results can be used by the experimentalist in ways 
hat allow direct interaction with the experiment (e.g. Cosmology data analysis involving 
telescope and satellite interaction, and coupling to simulations) 
single computational problems too large for any single system (e.g. extremely high 
resolution rotocrafi aerodynamic calculations) 

2 Requirements for Grids 

Analysis of several scientific application areas provides a broad set of requirements for Grids 
Examples from large-scale engineering and large-scale science are presented, and requirements 
are denved. Many of the requirements overlap, especially for the toolbuilders, and these are 
presented only in the engineering section. 


2.1 Engineering Examples and Requirements 

The NASA Aerospace Engineering Systems arena provides both the perspective of the 
computational scientists who build computational design tools, and of the design engineer / 

ana yst who must use those tools to accomplish a specific task ([3]). In summary, these 
requirements include: 

♦ Discipline analyst / problem solver requirements 

• multiple datasets maintained by discipline experts at different sites that support both 

geometric and computational design processes must be accessed and updated by many 
collaborating analysts J 

analysts must be able to securely share all aspects of their work process 

techniques are needed for coupling heterogeneous computer codes, resources, and data 

sources in ways so that they can work on integrated/coupled problems in order to provide 

whole system simulations (“multi-disciplinary simulation/optimization”) 

interfaces to data and computational tools must provide appropriate levels of abstraction for 

discipline problems solving 


2 



* techniques must provide transparent and untfomt 

control over all distributed resources participating in problem solving environm 
. workflow definition should be possible via “visual programming" scenarios that Integra 

. <*- and “ 

. tec^iques°are needed to describe and manage diverse strateg.es for parameter space 

. toohKed ^automatically manage and catalogue the numerous datasets the result from 

. SS managing generalized “faults" are required for all aspects of the working 

. STSd architecture independent serv ices must provide for various interprocess, 
interactive data-intensive, and multi-point communication _ 

. techniques’ are needed for debugging distributed software for correctness and performance 

• it must be Dossible to audit and account for use of all resources 

. co allocation of resources to support coordinated use of multiple resources and scheduled 
u^ofresources must be available, and must accommodate “fuzzy” reservatton (resource 

• be available for all resources including supporting 

construction of systems that have various “real-time” operating constraints 
. systems and operations professionals must be able to manage the dtstnbuted resources as 

part of a computing environment , . 

. resources should be “immune” to unauthorized access and manipulation 
! resomce^takeholders/owners should have easily used mechanisms ,0 enforee then use 
conditions and this must accommodate “fluid work groups 
. the security and access control services must provide for easily specified characteristics and 

must be easily integrated into applications and problem solving 

• CPU resource queuing mechanisms must provide a general an exi 

. ^Severn L^iSfies must be able to signal job actions and s “ eS 

• use of CORBA, Java, Java/RMI, and DCOM must be provided within the context of the 

• foo^^r^ne^dedto manage distributed heterogeneous computing architectures 

. general^zedresource discovery services are needed in older to provide readily available and 

detailed resource information . , 

. support for remote execution management should include automatic selection and 
installation of binaries and libraries appropriate for the target platform 
w v nf these reauirements are common to all of the science and engineering communities that 
will use Grids, and in the following two examples we only discuss additional requiremen s or 

different emphasis. 


3 



2.2 Science Examples 

DOE’S High Energy and Nuclear Physics (HENP) data analysis paradigm (4) epitomizes the 
large-scale data analysis environment. Hundreds to thousands of physicists from hundreds of 
institutions m countries around the world must analyze terabytes of data generated at a single site 
(the accelerator and detector). The analysis results must then be recombined in a common 
database. Requirements of the HENP community in addition to those above derive from a data 
analysis environment that is dominated by managing the location of data to be analyzed. Data 
must be moved around global networks and cached in, e.g., regional data centers, and from there 
to investigator data centers, etc. Requirements for this environment in addition to those above 


comprehensive network monitoring to locate, analyze, and correct bandwidth bottlenecks 
• data replica catalogues to provide global views of cached data 

the methodology and implementation of incorporating, using, and managing resources in 
the overall environment must be scalable to thousands of resources 


Existing and planned cosmology observational programs in DOE’s observational and 
computational cosmology program (for example Cosmic Microwave Background measurements 
and the Supernova Cosmology Program ([5], [6])) are generating sufficiently large and complex 
data sets that they will require many researchers using the facilities and expertise of multiple large 
scientific computing centers if they are to be successful scientifically. 


These (and many other) widely distributed scientific collaboration environments involve several 
processes: 

• remote production and remote storage of data 

• community processing of data 

• community execution of simulations 

management of multi-step data analysis and simulations where software components and 
data will use computing and storage resources at different sites 

• remote instrument interactions 


All of these processes incrementally contribute to an overall buildup of a scientific knowledge 
base, e.g. the accumulation of information from an on-going observational program that is 
managed as part of the collaboration. This management involves the additional element of being 
highly collaborative, with contributions coming from many sources, but with the requirement of a 
overall, or project, view of this information. 

These circumstances have lead to specific requirements for workflow and collaboration 
frameworks beyond those noted above. The basic workflow framework must provide for: 
describing and managing multi-step, asynchronous component workflows, including 
managing fault detection and recovery 

access to data and metadata publication and subscription mechanisms 
event mechanisms - e.g. notification of when data or simulation results come into existence 
anywhere in the space of resources of interest 
• user interfaces to each of the above 

Collaborative work support must address the human interaction aspects of collaborative data 



. maintenance of shared knowledge bases that allow a distributed community to create and 
update information about the state of overall progress of data ptocessmg, stmulatton results, 
existence of new, more highly refined, derived data, etc. 

• support for collaborative processing of data 

• support for on-line meetings, document sharing, and messaging 

• establishment and maintenance of the collaboration membership 

. security and management of access rights for the collaboration data and informa 

environments. . 

f 

3 An Overall Model for Grids 

services such as security and system management are required at all le . 

3.1 Problem Solving Environments: User Interfaces and Workflow Management 
The User Interface 

high throughput jobs, etc. Examples include SeiRun [7], Ecce (81. and WebFlow [9], 

Workflow Management 

Reliable operation of large and complex data analysis and simulation tasks requires methods for 

a _„ n in order to not only accommodate the routine processing, but it have sufficient elastic . 

rapidly locate and configure alternate resources in the even, of faults. 


5 




Application 
Development and 
Execution Support 
Services and 
Systems 




, , — Grid Common Services: 

Sta ndardized Services and Uniform Resource Interfaces ' 



Figure 1 


national user facilities 

A Grid Architecture - "upper” layers (top) and "lower” layers (bottom) 


3.2 Programming Services 

Tools and techniques are needed for building applications that run in Grid environments These 
cover a wide spectrum of programming paradigms, and must operate in a multi-platform 
heterogeneous computing environments. For example. Grid enabled MPI [ 10 ], Java bindings to 

nrOM^r’ C °n BA [ , 12] integrated with Grid services ’ Condor 113], Java/RMI [14], and perhaps 
tw—!i 6 ’ are 3 appllcatlon onented middleware systems that will have to interoperate with 
the Grid services in order to gain access to the resources managed by the Grid. 


6 


















3.3 Grid Common Services . 

run Grids is involved in maintaining these services. 


run (inds is invoivcu m — 

provide secure authentication of users. 

Execution Management: ~ t 

Several services are critical to managing the execu ^ n ^ ^gC^J^ “ “q^ons like: how 
is resource discovery and brokering. By set of properties; 

to find the set of objects (e.g. databases, CPUs, constraints such as allocation and 

how to select among many possible resources objects known 

scheduling; how to install a new object/service nt ° “ lates l0 glo bal views of 

as a Grid service. The second is execution qu g ’ t is distributed application 

CPU queues and their user-level management me chanisms for 

SSE Ini f^mSi^ supplying mfonuation .0 knowledge based recovery 

systems. 

multicast and remote I/O. 

Uniform naming and location transparent 

objects, computations, instruments and nebvorks. Th K m tun, Secondary 

(e.g. read, write seek) for ali ieve, paging”) that are 

Storage, etc.) and richer access and I/O mechanisms (eg- <W 

present in existing systems. . j 

High-speed, wide area, access to ternary ^storage sy ^™^'^^ y ^ C manageme „, services to 

access to data files, and the system mus nee d the ability to 

location of local, remote and cached copies of Hies on off-line data. 

«rrn,Vedt“n. and analysis, will be imponan. for both users and the managers of the 

Sm^rf™ ^"^mSd h" 

control flow tracking in distributed, multi-process simulations. 


7 



Environment Management: 

TTie key service that is used to manage the Grid environment is the “Grid Information Service ” 
Tins service - currently provided by Globus GIS (formerly MDS, see [20]) - maintains detailed 
characteristics and state information about all resources, and will also need to maintain ctynamic 

LouTng fnZZT' mf0rma "°" abm,t Currcn ' f rocess allocations and 

3.4 Resource Management for Co-Scheduling and Reservation 

One of the most challenging and well known Grid problems is that of scheduling scarce resources 
such as supercomputers and large instruments. In many, if not most, cases the problem is really 
one of co-scheduling multiple resources. Any solution to this problem must h£ve the agility to 
support transient experiments based on systems built on-demand for limited periods of time CPU 
advance reservation scheduling and network bandwidth advance reservation are cridcal 
components to the co-scheduling services. In addition, tape marshaling in tertiary storage systems 

£ e3nt fT? reS r ati ° nS ° f tCrtiary Stora S e s > stem <> ff 1 ^ ^a and/or capacity is likely* 

nroviHpH h ha J lc A ^ Uonaht y for co ' schedulin g and ^r resource reservation usually mus^ be 

provided by the individual resource managers. 

3*5 Operations and System Administration 

Implementing a persistent, managed Grid requires tools for deploying and managing the system 
_ In add,tl0n > t00ls for diagnostic analysis and distributed performance monitoring are 
required as are accounting and auditing tools. Operational documentation and procedures are 
essential to managing the Grid as a robust production service. procedures are 

3.6 Access Control and Security 

The first requirement for establishing a workable authentication and security model for the Grid is 
to provide a single-sign-on authentication for all Grid resources based on cryptographic 
credentials that are maintained in the users desktop / PSE environment(s) or on one’s person^ 
addition, end-to-end enciypted communication channels are needed in for many applications in 
order to ensure data integrity and confidentiality. This is provided by X.509 identity certificates 
(see [21]) together with the Globus security sendees. 

The second requirement is an authorization and access control model that provides for 
management of stakeholder rights (use-conditions) and trusted third parties to attest to 
corresponding user attributes. A policy-based access control mechanism that is based on 

for providing* tl^se capabilities. ^ “ *° * reqUir£ment SeVeral a PP roaches « being investigated 

3.7 Services for Operability 

To operate the Grid as a reliable, production environment is a challenging problem Some of the 
identified issues include management tools for the Grid Information Service that provides global 
information about the configuration and state of the Grid; diagnostic tools so operations/systems 
staff can investigate remote problems, and; tools and common interfaces forsystem and user 
administration, accounting, auditing and job tracking. Verification suites, benchmarks, regression 


8 



analysis tools fo, performance, reliability, .and system sensitivity testing are essential parts of 
standard maintenance. 

3.8 Grid Architecture: How do all these services fit together? 

resources,*Tnd^middleware tha^s^ports 0 different^^es o'f usage (efl- ^ifferent^rt^amming 
paradigms and access methods). 

services to accomplish its function. 

Further, the "layered" mode, should no, obscure the 

usually a collection f s not rig id, and “drill down” (e.g. code written for 

I^^SteSs an" Sb"r> must 4 «s„y managed by the Grtd servtces. 

4 NASA’s Information Power Grid 

For NASA, Grids such as IPG effectively represent a ofoXmThb 

organizations delivering large-scale com P ut ‘^^ 

environment is to deliver these ^ this service deliver 

distributed resources controlled, e.g., y mnnapement and maintenance, of 

model requires two things: — ^ Sifted, built, 

integrated collections of widely dtstnbuled, mult-stakeho der resot nce ^ be 

andVided to -he ^te msa „ -“"- 1 

and operation of this new shared responsibility service 
delivery environment must be explicitly addressed. 

4.1 How is IPG Being Built 

A strategy for budding multi-site Grids has evolved over the past two years at NASA, and thts 
experience is summarized here. (Also see [19].) 

,) Establish an Engineering Working Group tha. involves the Grid deployment teams a, each 
site. 

- schedule weekly meetings / telecons 

if at all nossible involve a Globus consultant in these meetings 

: Mtoblish^ul Engineering Working Group email list .ha. tncludes everyone involved ,n 

2) Identity the computing and storage resources to be incorporated into the Grid. 

3) Set up liaisons with the systems administrators for all systems that will be involved (com- 
puting and storage). 


9 



5) 


6 ) 


7 ) 


8 ) 


Determine the model of operation for the Grid Information service (MDS) 

- decide on Netscape LDAP hierarchy (“classic model”) vs. Globus OpenLDAP model 
(almost certainly the Globus OpenLDAP approach for new Grids) 

' th?n^ Sh i! he G l S (f eso u urce ™s P ace (Look at other Grids and see what they have done, 
hink carefully about the implications of your namespace and its relationship to the 

^ 3 ^ 30 ^ pTim) 1 mi§ht £VentUal,y P artici P ate - 0= Grid might be a meaningful top 

- plan on a GIS sever at each distinct site 

- get the GIS operational 

Build and test the security infrastructure. 

assuming a PKI based GSI, set up an X.509 Certification Authority 
issue host certificates for the resources 

count on revoking and re-issuing all of the certificates at least once before going operational 

validate correct operation of the GSI, GSI ssh, GSI ftp, etc. g P 

Establish the conventions for the Globus mapfile 

H ri 1 Xf tieS 10 systcm U1Ds - ,his is the basic aut horization mechanism for 
each individual platform - compute and storage 

establish the connection between user accounts on individual platforms and requests for 
Globus access on those systems (initially a non-intrusive mechanism such as email to the 
responsible sys admin to modify the mapfile is best) 

Validate network connectivity between the sites and investigate firewall issues, 
open portT ^ con ** lgured t0 use a restricted range of ports, but it still needs a handful of 

GIS/MDS also needs some open ports 


9 ) 


Establish a user help mechanism. 

• Grid user email list 

• if possible, a trouble ticket system 

Web pages with pointers to documentation and examples 

’ f w‘°X “ Q “J Ck SU,t GUide ” ,ha ‘ iS m ° difled 10 *““<= '» your Grid, with examples 

that will work in your environment F 

tTTf G1 ° b !*’ £ e GIS/MDS, and the security infrastructure should all be operational on 
he testbed system(s). The Gnd deployment team should be familiar with the install and operation 
issues, and the sys admins of the target resources should be engaged. 

Next step is to build a prototype-production environment. 

1 0) Deploy and build Globus on at least two computing platforms at two different sites. 

11) Establish GIS servers at each major site 

12) Establish the relationship between Globus job submission and the local batch schedulers 
(one queue, several queues, a Globus queue, etc.) 

13) Validate operation of this configuration. 


10 



,4) . °" ,he 

from user system to user job on computing platform, and back) 

• validate that all of these data paths work correctly 

serve as the interface between users and the Globus system administrators 
Identify early users and have the Globus application specialists assist them in getting applf 
cations running on the Grid. 


15) 

» 

16) 

17) 

18) 


Decide on a Grid job tracking and monitoring strategy. 

Put up one of the various Web portals for Grid resource monitoring. 


4.2 What is the Stafe of IPG? 

NASA Centers around the country in order that large-scale prooiems 
than is possible today. 

™ S *o Z en SO. Origin 2<XKls a, three NASA sites 

: ajarge' Cento p7ol““ aboutToO) integrated with Globus 

• several Terabytes of uniformly accessible mass storage 

• wide area network interconnects of at least 1 00 mbit/s 
. a stable and supported operational environment 

aroimdjhe^m^ computing and storage resources for inclusion in IPG 

. deployment of Globus , ®fl> monttoring throughout IPG 

! reliable, distnbuted Grid .nfonna.ton Serv.ce tha, 

. pi"^t^”egra,ion and deployment to support single sign-on 
using X.509 cryptographic identity certificates (see [21]) 


n 



NASA’s IPG Baseline System and 
High Data-Rate (DX) Testbed 


Figure 2 


First Phase of NASA’s Information Power Grid 

network infrastructure and QoS 

tertiary storage system metadata catalogue and uniform access system 
operational and system administration procedures for the distributed IPG 
user and operations documentation 

gTmpHio?^ rnpnf ,i0 " managemeM aCTOSS "ith multiple stakeholders 

Gnd MPI [io] and CORBA programming middleware systems integration 

high throughput job management tools 
distributed debugging and performance monitoring tools 

4.3 What Types of Applications are Bring Run on IPG? 

4.3.1 Data Mining 

The University of Alabama in Huntsville has developed a data mining system called ADaM 
(Algorithm Development and Mining). The current design consists o?t m lg e " tae and^ 
daemon-controlled database. The database contains information about the date to be mined 
including its type and its location. To mine for data, the user provides the mining engine with a 
mining plan that consists of the sequential list of mining operations that are to be perfo^ed along 

the dat h Param that ™ y be requ,red for each minin g operation. The miningengine consult! 
the database in order to find out where the data to be mined is stored and then applies the mining 


12 



plan to the set of data that has been identified to the database. Each mining operation is 
represented as a shared-library file, one file per operation. 

located on the user's workstation. 

Using globusrun, the user is able to 
stage the mining engine to another 
processor to execute. As required, the 
mining engine will acquire mining 
operation executables in the form of 
shared-library files from an operator 
repository on the IPG. Since a single 
mining plan may involve only a 
handful of operators (out of the 70+ 
operations that ADaM currently 
support), this means that only the 
required mining operators need to be 
sent to the IPG node that is currently 
supporting the mining engine. This is 
accomplished using the Globus data 
transfer functions. 

As it executes, thd mining engine 
stages the data to be mined from the 
data repository to the processor where 
the mining engine is executing. There 
are currently several sites that act as 
data repositories, and which currently 
pull data from NASA's Global 
Hydrology and Climate Center (which 



Figure 3 512 node SGI Origin at NASA Ames uses 
IPG uniform interface data access tools to 
simultaneously mine hydrology data from four sites. 


SS == £ an FTP directory access*, e through the web) so that it can 
be mined for severe storms. 


This is work of Tom Hinke (thinke@mail.arc.nasa.gov). 


4.3.2 Parameter Studies 

ILab is an aerospace parameter study u. loSletnd mTnage ^Lputc resources for 

° fMaUnCe YamW fratrow@nas.nasa.gov), NASA 

Ames. . . . 

» - s 


13 



computational inefficiency of Java. See [ 25 ] 
(globus@nas.nasa.gov), NASA Ames. 


and [26], This is work of A1 Globus 


4.3.3 High Latency Algorithm R&D 

IPG provides services for aggregating computing 
resources in a parallel and distributed fashion. One 
future application of this will be single simulations 
that operate across many, widely distributed 
systems. Current algorithms, however, do not 
accommodate the high and variable latencies 
encountered in such a computing environment. The 
research branch of the NAS Division is 
investigating algorithms that are suitable for Grid 
computing environments. One candidate is overset 
grid codes that can tolerate timestep mis-matches 
on the intra-object boundaries. A version of the 
OVERFLOW, Navier-Stokes, CFD simulation code 

GRT^nd^R? f T thiS * PP ~ acI \ ft h3S b6en demonstrated operating across systems at ARC, 

ofMohTm^H ^ n° Vin i °L fl ° W largC teSt ° bjects mounted in a wind rimnel. This is work 
or Mohammad J. Djomehn (djomehn@nas.nasa.gov). 



Figure 4 The ILab, aerospace 
parameter study system: A Grid based 
Problem Solving Environment. 


i 


high-lift subsonic 
wind tunnel model 



Figure 5 Experiments in latency tolerant algorithms. 


14 



5 Scaling Issues for Grids 

Experience with NASA’s IPG, DOE’s High Energy Physics Grids, DOE’s Energy Sciences 
Network’s large-scale directory services, and the DOE2000 Collaboratoty program s expenence 
with security and PKI, can be used to identify a list of issues that, apart from the basic ; Cm 
"tr&ncW, wffl also have to be addresses in order to build large-scale Gnds for 

science. These include: 

• Directory services , „„ A 

- naming to support many virtual organizations that may have some resources a 

- scaling i th^d^ctories themselves to support searches of thousands of resources across 

- suppo^^ services (e.g. data replica catalogues) within the Science 

Grid directory infrastructure and namespace 

• Se ^root” X.509 identity certificate, Certification Authority (to sign the institutional CA 

certificates in order to facilitate cross-institutional identities) 

- Grid-wide Certificate Revocation List “repository” 

' US a ^“ticket system for collecting, codifying, and correcting problems 

-^infrastmcture-w^de monitoring to identify performance bottlenecks in various resources 

(networks, computing, and storage) 

• Integration witfi large tertiary storage systems 

- interface with local MSS groups to provide GndFTP [l l] 

• Integration with large numbers of computing resources 

- v^rk with local sys admins to get Globus server side software and authorization files 
installed and debugged on local systems 

R Trovide U toorMd repositories for standard resource accounting records that can be used 
for project-level resource allocation management across the Gn 

• Test and validation suites ... , 

- provide standard Grid resource configuration validation test suites and facilitate the 

routine use and results analysis for these suites across the Grid 
Work is being done in all of these areas, however in this paper we will look in detail only at 
information services. 

5.1 The Grid Information Service 

Grids will be global infrastructure, and will depend heavily on the ability to > locate i 
about computing, data, and human resources for particular purposes, and wit m p 
contexts Furthef, most Grids will serve virtual organizations whose members are affiliated by 
common administrative parent (e.g. the DOE Science Grid and NASA's Info ^ atl °^°7^ ) ’ 
common long-lived project (e.g. the High Energy Physics, Atlas experiment), common fund g 

source, etc. 


15 


The user/fimctional requirements and operational requirements for a Grid Information Service to 
satisfy these needs presents a substantial problem in scaling the current LDAP based approaches. 

(This work is done in collaboration with Mike Helm (helm@fionn.es.net), Lawrence Berkeley 
National Lab.) 

5.1.1 User Requirements 
Searching 

The basic sort of question that a GIS must be able to answer is for all resources in a virtual 
organization, provide a list of those with specific characteristics. 

For example: 

“Within the scope of the Atlas collaboration, return a list of all Sun systems with at least 2 

CPUs and 1 gigabyte of memory, that are running Solaris 2.6 or Solaris 2.7, and for which I 
have an allocation.” 

Answering this question involves examining both the virtual organization attribute and the 
resource attributes in order to produce a list of candidates. 

Virtual Organizations 

It should be possible to provide “roots” for virtual organizations. These nodes provide search 
scoping by establishing roots that sit at the top of a hierarchy of virtual org. resources, and 
therefore starting places for searches. Like other named objects in the Grid, these virtual’ org 
nodes might have characteristics specified by attributes and values. In particular, the virtual 
organization node probably needs a name reflecting the org. name, however some names (e.g. for 
resources) may be inherited from the Internet DNS domain names. Virtual organizations may find 
it convenient to register with a Grid “root” so that they can share resources if policy allows. This 
is addressed by layers in Figure 6 labeled “root,” “virtual org,” and “resources.” 

Information and Data Objects 

A variety of other information will probably require cataloguing and global access, and the GIS 

should accommodate this in order to minimize the number of long-lived servers that have to be 
managed: 

• dataset metadata 

• dataset replica information 

• database registries 

• Grid system and state monitoring objects 

• Grid entity certification/registration authorities (e.g. X.509 Certificate Authorities) 

• Grid Information Services object schema 

Therefore it should be possible to create arbitrary nodes to represent other types of information, 
such as information object hierarchies. 

This sort of information has to be consistently named in a global context, will have to be 
locatable, and in some cases will have an inherently hierarchical structure. 

Requirements for these catalogues include: 

• providing unique and consistent object naming 

• access control 


16 



searching, discovery, and publish/subscribe 


5.1.2 Operational Requirements 
Performance and Reliability 

♦ Queries, especially local queries, should be satisfied in times that are comparable to other 
queries like uncached DNS data. E.g., seconds or fractions of seconds. 

« Local sites should not be dependent on remote servers to locate and search local resources. 

♦ It should be possible to restrict searches to local resources of a single, local, administrative 
domain. 

♦ Site administrative domains may wish to restrict access to local information, and therefore will 
want control over a local, or set of local, information servers. 

These imply the need for servers intermediate between local resources and the virtual org. root 
that are under local control for security, performance management, and reliability managemen . 
This is addressed by the layer labeled “local control” in Figure 6. 

(Note that in the Globus terminology that these intermediate directory servers are called GIIS s.) 
Multiple Membership 

Many objects/resources will have membership in multiple virtual organizations. This information 
like other resource attributes, will likely be maintained at the resources in order to minimi 
man agement tasks at .the upper level nodes. 

♦ It must be possible for a 
resource to register with 
multiple virtual 
organizations (note the 
multiply connected 
resources in Figure 6). 

Minimal Manual 
Management 

The management of the 
information servers above 
the resources (in the case of a 
resource catalogue) must be 
as automatic/minimal as 
possible. 

4 Information about a 
resource should be 
maintained at that 
resource, and should 
propagate automatically to 
superior information servers. 



Figure 6 


Structure of a Grid Information Space. 


17 



Control over Information Propagation 

At each level of information management (four have emerged so far) there are various reasons 
why both import and export controls will have to be established. 

♦ At the object / resource level (see Figure 7), the local administrators must have control over 
what information is exported for the purposes of registration. 



♦ At the object / resource level there must be access control mechanisms to restrict the types of 
queries or the detail that queries return. 

♦ The nodes at the level of “local control” are meant to model a common system administration 
domain, and must support a common security policy, including who is allowed to register 
(import control) and what information is passed outside of the security domain (export 
control). It should, e.g., be possible to implement policies such as making anonymous the 
information that is passed to the next level up (either for registration or as search results). 

♦ Such anonymous information should allow broad searches at the upper levels, but limit 
specific searches to the lower levels, where searches can be authorized based on the 
relationship of the searcher to the resource. 


18 

















♦ The same sorts of capabilities as exist at the local control level must be available at the virtual 
organization level in order to maintain control over the charactenst.cs of the virtual 
organization 

♦ At the root, again it must be possible to apply policy to registration (e.g. to prevent nodes 
below the virtual org. level from registering at the root). 

♦ The ability to do automatic node replication for reliability will exist at all levels. 

♦ Information import and export must be automatic and the content subject to management. See 
the “import/export” nodes in Figure 7. 


Performance and Robustness 

Finally when we consider the information flow implied by the structure in Figure 7 it is apparent 
that a lot of information may flow in complex patterns. Well tested components and procedures 
will be needed. We will probably need some modeling or measurement information on the 
volume and rate of data flow in such an environment in order to assess the scaling issues. 
Warren Smith (wwsmith@nas.nasa.gov), NASA Ames, is working on this problem. 


6 Conclusion 

There are many challenges in making Grids a reality, in the sense that they can provide new 
capabilities in production quality environments. 

While the basic Grid services have been demonstrated, e.g. in the IPG prototype demonstrated at 
the Supercomputing M 999 conference, and the GUSTO testbed ([22]) demonstrated in 1998, a 
general purpose computing, data management, and real-time instrumentation Grid involves many 
more services. One challenge is to identify the minimal set of such services and another is to scale 
the services to Grids that knit together resources at hundreds of sites. 

In the case of the analysis of the information services scaling issues described above, Mike Helm 
of DOE’s Energy Sciences Network is building on ESNet’s long experience in | very a ^ ge 

X.500 services to devise solutions for the issues noted above. (The solution will l e y invo ve 
careful name space definition and management, and use of commercial meta-directory systems o 
off-load the GIS LDAP servers, etc. - A paper on this will be available a 
http://www.lbl.gov/~mike/globus.) And Warren Smith of NASA Ames is building a performance 
modeling and simulation system for the GIS. 

In addition to the NASA and DOE projects, Grids are being developed by a substantial and 
increasing community of people who work together in a loosely bound coordinating organization 
called the Grid Forum (www.gridforum.org - [ 23 ]). 


7 Acknowledgements 

Almost everyone in the NAS Division of the NASA Ames Research Center, numerous other 
people at the NASA Ames, Glenn, and Langley Research Centers, as well as many people 
involved with the NSF PACIs (especially Ian Foster, Argonne National Lab Carl Kesselman 
USC/ISI, Randy Butler, NCSA, and Reagan Moore, SDSC) have contributed to IPG. Bill 
Feiereisen, NAS Division Chief, provided the initial vision for IPG and strong support for i s 
development and deployment. Bill Nitzberg (now with Veridian Systems), Dennis Gannon 


19 



(Indiana University), Leigh Ann Tanner, and Arsi Vaziri of NASA Ames have made special 
contributions to IPG Tom Hinke, Maurice Yarrow, A1 Globus, Mohammad J. Djomehri, and 

Warren Smith of NASA Ames, and Mike Helm, of DOE’s ESNet, are all acknowledged in the 
text. & 

IPG is funded primarily by NASA’s Aero-Space Enteiprise, Information Technology (IT) 
program (http://www.nas.nasa.gov/IT/overview.html). DOE’s Science Grid is funded by the U.S. 
Dept, of Energy, Office of Science, Office of Advanced Scientific Computing Research, 
Mathematical, Information, and Computational Sciences Division 
(http://www.sc.doe.gov/production/octr/mics) under contract DE-AC03-76SF00098 with the 
University of California. 

8 References 

[1] Foster, I., and C. Kesselman, eds., The Grid: Blueprint for a New Computing Infrastructure, 
edited by Ian Foster and Carl Kesselman. Morgan Kaufmann, Pub. August 1998. ISBN 

1 -55860-475-8. http://www.mkp.eom/books_catalog/l -55860-475-8.asp 

[2] Numerical Propulsion System Simulation (NPSS) - see 

http ://hpcc . lerc . nasa.gov/npssintro. shtml 

[3] “Information Power Grid.” See http://www.ipg.nasa.gov for project information. The 
implementation plan, including a requirements analysis section, is located at 
http://www.ipg.nasa.gov/Engineering/requirements.html . 

[4] See, e.g., http://www.cacr.caltech.edu/ppdg/ 

[5] Science Magazine Names Supernova Cosmology Project ‘Breakthrough of the Year’,” 

LBNL Research News, http://www.lbl.gov/supemova/ 

[6] “High Redshift Supernova Search Home Page of the Supernova Cosmology Project.” See 
http://www-supemova.lbl.gov/ 

[7] SCIRun is a scientific programming environment that allows the interactive construction, 
debugging and steering of large-scale scientific computations. 
http://www.cs.utah.edu/~sci/software/ 

[8] Ecce - www.emsl.pnl.gov 

[9] WebFlow - A prototype visual graph based dataflow environment, WebFlow, uses the mesh 
of Java Web Servers as a control and coordination middleware, WebVM: See 
http://iwt.npac.syr.edu/projects/webflow/index.htm 

[10] Foster, I., N. Karonis, “A Grid-Enabled MPI: Message Passing in Heterogeneous 
Distributed Computing Systems.” Proc. 1 998 SC Conference. Available at 

http ://www- fp . globus . org/ documentation/papers . html 

[11] See http : / /www. gl obus . org/datagri d I 

[12] Otte, R., P. Patrick, M. Roy, Understanding CORBA, Englewood Cliffs, NJ, Prentice Hall 

1996. 

[13] Livny, M, et al, “Condor.” See http://www.cs.wisc.edu/condor/ 

[14] Sun Microsystems, “Remote Method Invocation (RMI).” See 

http://developer.java.sun.eom/developer/technicalArticles//RMI/index.html . 

[15] Grimshaw, A. S., W. A. Wulf, and the Legion team, “The Legion vision of a worldwide 
virtual computer”, Communications of the ACM, 40(l):39-45, 1997. 


20 



[161 Microsoft Corp., “DCOM Technical Overview.” November 1996. See 

http://msdn.microsoft.com/library/backgmd/html/msdn_dcomtec.htm . 

[17] See, e.g., http://www.nas.nasa.gov/Main/Features/2001/Winter/launchpad.html 

r 1 81 Foster, I., C. Kesselman, Globus: A metacomputing infrastructure toolkit”, Int 1 J. 

Supercomputing Applications, 1 1 (2); 1 1 5- 1 28, 1997. (Also see http://www.globus.org) 

[19] “Hitchhiker’s Guide to the Grid,” http://www.ipg.nasa.gov/ -> User Support -> 

Documentation i( 

[20] Fitzgerald, S., I. Foster, C. Kesselman, G. von Laszewski, W. Smith, S. Tuecke, A 

Directory Service for Configuring High-Performance Distributed Computations. Proo. 6th 
IEEE Symp. on High-Performance Distributed Computing, pg. 365-375, 1997. Available 

from http://www.globus.org/documentation/papers.html . 

[211 Public-Key certificate infrastructure (“PKI”) provides the tools to create and manage 

digitally signed certificates.For more information, see, e.g.,RSA Lab’s ‘ Frequently As e 
Questions About Today's Cryptography” http://www.rsa.com/rsalabs/faq/, Compiler 
Communications Security: Principles, Standards, Protocols, and Techniques, W. Ford 
Prentice-Hall, Englewood Cliffs, New Jersey, 07632, 1995, or Applied Cryptography, B. 
Schneier, John Wiley & Sons, 1996. 

[221 “Globus Ubiquitous Supercomputing Testbed Organization” (GUSTO). At Supercomputing 
1998 GUSTO linked around 40 sites, and provides over 2.5 TFLOPS of compute power, 
thereby representing one of the largest computational environments ever constructed at that 
time. See http://www.globus.org/testbeds . 

mi Grid Forum. The Grid Forum (www.gridforum.org) is an informal consortium of institutions 
and individuals working on wide area computing and computational gnds: the technologies 
that underlie such activities as the NCSA Alliance's National Technology Grid, NAPES s 
Metasystems efforts, NASA’s Information Power Grid, DOE ASIA’S DISCOM program, 
and other activities worldwide. 

[241 “Maurice Yarrow, Karen M. McCann, Rupak Biswas, and Rob F. Van der WijngMrt “An 
Advanced User Interface Approach for Complex Parameter Study Process Specification on 

the Information Power Grid.” x ^ . 

http • //www.nas .nasa. gov/Research/Reports/Techreports/2000/nas-00-009-abstract.html 

125] A1 Globus, Sean Atsatt, John Lawton, and Todd Wipke, “JavaGenes: Evolving Graphs with 

Crossover,” 2000. http://www.nas.nasa.gov/~globus/papers/JavaGenes/paper.html 

[26] A1 Globus, Eric Langhirt, Miron Livny, Ravishankar Ramamurthy, Marvin Solomon, 
and Steve Traugott, “JavaGenes and Condor: Cycle-Scavenging Genetic Algorithms^ Java 
Grande 2000, sponsored by ACM SIGPLAN, San Francisco, California, 3-4 June 2000. 
http ://www.nas .nasa . gov/~globus/papers/ J avaGrande2000ZJavaGrandePaper.html 


21 



